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We consider a circular deconvolution problem, where the density / of a cir- 
cular random variable X has to be estimated nonparametrically based on an 
iid. sample from a noisy observation Y of X. The additive measurement error 
is supposed to be independent of X. The objective of this paper is the con- 
struction of a fully data-driven estimation procedure when the error density <p 
is unknown. However, we suppose that in addition to the iid. sample from Y, 
we have at our disposal an additional iid. sample independently drawn from the 
error distribution. 

First, we develop a minimax theory in terms of both sample sizes. We pro- 
pose an orthogonal series estimator attaining the minimax rates but requiring 
an optimal choice of a dimension parameter depending on certain characteristics 
of f and (p, which are not known in practice. The main issue addressed in our 
work is the adaptive choice of this dimension parameter using a model selection 
approach. In a first step, we develop a penalized minimum contrast estimator 
supposing the degree of ill-posedness of the underlying inverse problem to be 
known, which amounts to assuming partial knowledge of the error distribution. 
We show that this data-driven estimator can attain the lower risk bound up to a 
constant in both sample sizes n and m over a wide range of density classes cover- 
ing in particular ordinary and super smooth densities. Finally, by randomizing 
the penalty and the collection of models, we modify the estimator such that it 
does not require any prior knowledge of the error distribution anymore. Even 
when dispensing with any hypotheses on (p, this fully data-driven estimator still 
preserves minimax optimality in almost the same cases as the partially adaptive 
estimator. 
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1. Introduction 



This work deals with the estimation of the density of a circular random variable from noisy 
observations. Such data occur in various fields of natural science, as for example in geology 
and biology, to mention but two. Curray (1956) discusses the analysis of directional data 
in the context of geological research, where it is often useful to measure and analyze the 
orientations of various features. More recently, Cochran et al. (2004) investigated migrating 
songbirds' navigation abilities. They fitted birds with radio transmitters and placed them 
in outdoor cages in an artificially turned magnetic field. The observations consisted of 
the directions the birds departed in when released. Such directional observations can be 
represented as points on a compass rose and hence on the circle. For a more general and 
detailed discussion of the particularities of circular data we refer to Mardia (1972) and 
Fisher (1993). 

Let X be the circular random variable whose density / we are interested in and e an 
independent additive circular error with unknown density ip. Denote by Y the contaminated 
observation data and by g its density. Throughout this work we will identify the circle with 
the unit interval [0, 1), for notational convenience. Thus, X and e take their values in [0, 1). 
Let [-J be the floor function. Taking into account the circular nature of the data, the model 
can be written as Y = X + e — [X + e\ or equivalently Y = X + e mod [0, 1). Then, we 
have 

g{y) = (/ * tp)(y) ■= / /((y -s)-[y-s\) tp(s) da, y g [0, l), 

J[0,1) 

such that * denotes circular convolution. Therefore, the estimation of / is called a circular 
deconvolution problem. Let I? := L 2 ([0, 1)) be the Hilbert space of square integrable 
complex-valued functions defined on [0,1) endowed with the usual inner product (/, g) = 
J[o l) f( x )9( x )d% where g(x) denotes the complex conjugate of g(x). In this work we suppose 
that / and (p, and hence g, belong to the subset T> of all densities in 1? . As a consequence, 
they admit representations as discrete Fourier series with respect to the exponential basis 
{ej}j & z of I? , where ej(x) := exp(— i2njx) for x G [0, 1) and j G Z. Given p G V and j G Z 
let \p]j := (p,ej) be the j-th Fourier coefficient of p. In particular, [p]o = 1. The key to the 
analysis of the circular deconvolution problem is the convolution theorem which states that 
g = f * ip if and only if [g]j = [f]j [ip]j for all j G Z. Therefore, as long as [cp]j / for all 
j E Z, which is assumed from now on, we have 

Z^+^jj^ with [gjj = E ej {-Y) and [ip]j = Ee^-e), V j G Z. (1.1) 
\j\>o [(pb 

Note that an analogous representation holds in the case of deconvolution on the real line 
when the X-density is compactly supported, but the error term e, and hence Y, take their 
values in M. In this situation, the deconvolution density still admits a discrete represen- 
tation as in (1.1), but involving the characteristic functions of ip and g rather than their 
discrete Fourier coefficients. There is a vast literature on deconvolution on the real line, 
with or without compactly supported deconvolution density. In the case the error density 
is fully known, a very popular approach based on kernel methods has been considered by 
Carroll and Hall (1988), Devroye (1989), Fan (1991, 1992), Stefanski (1990), Zhang (1990), 
Goldenshluger (1999, 2000) and Kim and Koo (2002)), to name but a few. Mendelsohn 
and Rice (1982) and Koo and Park (1996), for example, have studied spline-based methods, 
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while a wavelet decomposition has been used by Pensky and Vidakovic (1999), Fan and 
Koo (2002) and Bigot and Van Bellgem (2009), for instance. Situations with only partial 
knowledge about the error density have also been considered (c.f. Butucea and Matias 
(2005), Meister (2004, 2006), or Schwarz and Van Bellegem (2009)). Consistent deconvolu- 
tion without prior knowledge of the error distribution is also possible in the case of panel 
data (c.f. Horowitz and Markatou (1996), Hall and Yao (2003) or Neumann (2007)) or by 
assuming an additional sample from the error distribution (c.f. Diggle and Hall (1993), 
Neumann (1997), Johannes (2009) or Comte and Lacour (2009)). For a broader overview 
on deconvolution problems the reader may refer to the recent monograph by Meister (2009). 

Let us return to the circular case. In this paper we suppose that we do not know the error 
density (p, but that we have at our disposal in addition to the iid. sample (Yk)]^ =1 of size 
n G N from g an independent iid. sample (efc)fcLi of size m G N from ip. Our purpose is 
to establish a fully data-driven estimation procedure for the deconvolution density / which 
attains optimal convergence rates in a minimax-sense. More precisely, given classes J-T and 
£f (defined below) of deconvolution and error densities, respectively, we shall measure the 
accuracy of an estimator / of / by the maximal weighted risk supj g jrr sup^^-d — /||^, 

defined with respect to some weighted norm ||-||^ := Ylj^z u j\ V\j\ 2 i where uj := (ujj)j e z is 
a strictly positive sequences of weights. This allows us to quantify the estimation accuracy 
in terms of the mean integrated square error (MISE) not only of / itself, but as well of 
its derivatives, for example. It is well known that even in case of a known error density 
the maximal risk in terms of the MISE in the circular deconvolution problem is essentially 
determined by the asymptotic behavior of the sequence of Fourier coefficients ([f])jez and 
(M)jez of the deconvolution density and the error density, respectively. For a fixed decon- 
volution density /, a faster decay of the e-density's Fourier coefficients (k/?]) je z results in 
a slower optimal rate of convergence. In the standard context of an ordinary smooth de- 
convolution density for example, i.e. when ([/])jez decays polynomially, logarithmic rates 
of convergence appear when the error density is super smooth, i.e., ([9?])j g z has a exponen- 
tial decay. This special case is treated in Efromovich (1997), for example. However, this 
situation and many others are covered by the density classes 



where r, d ^ 1 and the positive weight sequences 7 := ('yj)jez an d A := specify the 

asymptotic behavior of the respective sequence of Fourier coefficients. In section 2 we show 
a lower bound of the maximal weighted risk which is essentially determined by the sequences 
7, A and uj. This lower bound is composed of two main terms, each of them depending on 
the size of one sample, but not on the other. Let us define an orthogonal series estimator 
by replacing the unknown Fourier coefficients in (1.1) by empirical counterparts, that is, 




F d ._ 




f k :=l+ Yl ffl{IM/^l/m} ei 



with 



— 1 n 
[^ : =-2>( 



Y k ) and [ip] j := -^e,(-e fc ). (1.2) 



k=l 



k=l 
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Again, things work out similarly in deconvolution on the real line, where one only has 
to replace the empirical Fourier coefficients by the corresponding values of the empirical 
characteristic functions. Similar estimators have already been studied by Neumann (1997) 
on the real line and by Efromovich (1997) in the circular case, for example. We show below 
that the estimator attains the lower bound and is hence minimax optimal. By comparing 
the minimax rates in the cases of known and unknown error density, we can characterize 
the influence of the estimation of the error density on the quality of the estimation. In 
particular, depending on the Y-sample size n, we can determine the minimal e-sample size 
m n needed to attain the same upper risk bound as in the case of a known error density, 
up to a constant. Interestingly, the required sample size m n is far smaller than n in a wide 
range of situations. For example, in the super smooth case, it is sufficient that the size of 
the e-sample is a polynomial in n, i.e. m n = n r for any r > 0. 

Of course, minimax optimality is only achieved as long as the dimension parameter k is 
chosen in an optimal way. In general, this optimal choice of k depends among others on 
the sequences 7 and A. However, in the special case where the error density is known to 
be super smooth and the deconvolution density is ordinary smooth, the optimal dimension 
parameter depends only on A but not on 7. Hence, the estimator is automatically adaptive 
with respect to 7 under the optimal choice of k. In this situation Efromovich (1997) provides 
an estimator which is also adaptive with respect to the super smooth error density. On 
the contrary, Cavalier and Hengartner (2005), deriving oracle inequalities in an indirect 
regression problem based on a circular convolution contaminated by Gaussian white noise, 
treat the ordinary smooth case only. As in our setting, their observation scheme involves 
two independent samples. It is worth to note that in order to apply these estimators, one 
has to know in advance at least if the error density is ordinary or super smooth. We provide 
in this work a unified estimation procedure which can attain minimax rates in either of the 
both cases, that is, which is adaptive over a class including both ordinary and super smooth 
error densities. This fully adaptive method to choose the parameter k, only depends on 
the observations and not on characteristics of neither / nor ip. The central result of the 
present paper states that for this automatic choice k, the estimator j~ attains the lower 
bound up to a constant, and is thus minimax-optimal, over a wide range of sequences 7 
and A, covering in particular both ordinary and super smooth error densities. 
As far as the two sample sizes are concerned, the assumption made by Cavalier and Hen- 
gartner (2005) on the respective noise levels can be translated to our model by stating that 
the e-sample size m is at least as large as the Y-sample size n. This assumption is also 
used by Efromovich (1997). However, as mentioned above, without changing the minimax 
rates, the e-sample size can be reduced to m n , which can be far smaller than n. This is a 
desirable property, as the observation of the additional sample from e may be expensive in 
practise. Nevertheless, the minimal choice of m depends among others on the sequences 7 
and A and is hence unknown in general. In spite of the minimax rate being eventually dete- 
riorated by choosing the sample size m smaller than n, the proposed estimator still attains 
this rate in many cases, that is, no price in terms of convergence rate has to be paid for 
adaptivity. Surprisingly, even in the cases where the optimal rate is not attained anymore, 
the deterioration is only of logarithmic order as far as the error density is either ordinary 
or super smooth. 

The adaptive choice of k is motivated by the general model selection strategy developed 
in Barron et al. (1999). Concretely, following Comte and Taupin (2003), who treat the case 
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of a known error density only, k is the minimizer of a penalized contrast 
k := argminj - \\fk\\l + pen(/s)}. 

Note that we can compute = l+X^o<|j|^fc a; il[5]il 2 |[^]il _2 -''-{l[^]il 2 ^ ^/m}- As in case 

of a known error density, it turns out that the penalty function pen(-) as well as the upper 
bound K needed for the right choice of k depend on a characteristic of the error density 
which is now unknown. This quantity is often referred to as the degree of ill-posedness of the 
underlying inverse problem. Therefore, as an intermediate step, assuming this parameter 
to be known, we show an upper risk bound for this partially adaptive estimator f<r. We 
prove that over a wide range of sequences 7 and A, the adaptive choice of k yields the same 
upper risk bound as the optimal choice, up to a constant Finally, we drop the requirement 
that the degree of ill-posedness is known. In order to choose k adaptively even in this case, 
we replace pen(-) and K by estimates only depending on the data. As in the case of known 
degree of ill-posedness, we show an upper risk bound for the now fully adaptive estimator. 
It is noteworthy that even though the proofs are more intricate in this case, the result 
strongly resembles its analogon in the case of known degree of ill-posedness. 
Let us return briefly to deconvolution on the real line with compactly supported X-density. 
We note that in this situation the adaptive choice of k can be performed in the same way. 
Moreover, the upper risk bounds remain valid, and the adaptive estimator is minimax opti- 
mal over a wide range of cases. In fact, the circular structure of the model is only exploited 
in the proof of the lower bound and in order to guarantee the existence of the discrete 
representation in (1.1), which still holds in case of a compactly supported deconvolution 
density. 

This article is organized as follows. In the next section, we develop the minimax theory 
for the circular deconvolution model with respect to the weighted norms introduced above 
and we derive the optimal convergence rates in the ordinary and in the super smooth case. 
Section 3 is devoted to the construction of the adaptive estimator in the case of known degree 
of ill-posedness. An upper risk bound is shown and convergence rates for the ordinary and 
super smooth case are compared to the minimax optimal ones. The last section provides 
the fully adaptive generalization of this method. All proofs are deferred to the appendix. 

2. Minimax optimal estimation 

In this section we develop the minimax theory for the estimation of a circular deconvolution 
density under unknown error density when two independent samples from Y and e are 
available. A lower bound depending on both sample sizes is derived and it is shown that 
the orthogonal series estimator defined in (1.2) attains this lower bound up to a constant. 
All results in this paper are derived under the following minimal regularity conditions. 

Assumption 2.1 Let 7 := ('jj)j^z, w := (tOj)j e z an d A := (^j)jez be strictly positive 
symmetric sequences of weights with 70 = ojq = \q = 1 such that (uJ n /j n ) n £fq and (\ n )neN 
are non-increasing, respectively. 

Remark that Xj is even a null sequence as \j\ tends to infinity as we suppose ip to be a 
density in L? . The assumption that cj/7 is non-increasing ensures that the weighted risk is 
well defined. 
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Lower bounds The next assertion provides a lower bound in case of a known error density, 
which obviously will depend on the size of the Y-sample only. Of course, this lower bound 
is still valid in case of an unknown error density. 

Theorem 2.2 Suppose an iid. Y-sample of size n and that the error density ip is known. 
Consider sequences u), 7, and A satisfying Assumption 2.1 such that Ylj&lJ 1 = T < 00 
and such that ip S £^ for some d ^ 1. Define for all n ^ 1 



ipn '■= ^n{l,\u) : = minjmaxf — , V] — 3 -X\ 



and 

nA-i / J 

0<\j\<k 



K ■= K(l, : = argmin{max( — , V — ?-)). (2.1) 

0<\j\^k 3 



If in addition n : = inf n ^i{?/> n 1 min(a;fc*7 A ,» 1 , ^o<|«|^fc* u l( n ^l) 1 )l > 0' ^ en / or all n ^ 2 
and for any estimator f of f we have 



^{n\J-nl}> ^ mm(r - l R ll{m) ^ 



The proof of the last assertion is based on Assuoad's cube technique (c.f. Korostolev and 
Tsybakov (1993)), where we construct 2 2fc " candidates of deconvolution densities which 
have the largest possible || -^-distance but are still statistically non distinguishable. It is 
worth to note that the additional assumption Y^j&lJ 1 = T < 00 is only used to ensure 
that these candidates are densities. Observe further that in case r = 1, the lower bound 
is equal to zero, because in this situation the set J 7 ^ reduces to a singleton containing the 
uniform density. In the next theorem we state a lower bound characterizing the additional 
complexity due to the unknown error density, which surprisingly depends only on the error 
sample size. 

Theorem 2.3 Suppose independent iid. samples from Y and e of size n and m, respectively. 
Consider sequences uj, 7, and A satisfying Assumption 2.1. For all m ^ 2, let 

Km ■= K m (j,X,uj) : = maxL^" 1 minfl, -^7-))). (2.2) 

If in addition there exists a density in S"/^ which is bounded from below by 1/2, then, for 
all m ^ 2 and for any estimator f of f we have 

sup su P {e||/-/|| 2 }. 

The proof of the last assertion takes its inspiration from a proof given in Neumann (1997). 
In contrast to the proof of Theorem 2.2 we only have to compare two candidates of error 
densities which are still statistically non distinguishable. However, to ensure that these 
candidates are densities, we impose the additional condition. It is easily seen that this 
condition is satisfied if A := X^'ez \' < 00 an( ^ ^ max(4A 2 , 1). It is worth to note 
that in case d = 1, the set £f of possible error densities reduces to a singleton, and hence 
the lower bound is equal to zero. Finally, by combination of both lower bounds we obtain 
the next corollary. 



min(r - 1, 1) min(l/(4d), (1 - tT 1 / 4 ) 2 ) 

r~ for 
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Corollary 2.4 Under the assumptions of Theorem 2.2 and 2.3 we have for any estimator f 
of f and for all n, m ^ 2 that 

r ll7 „ ll2 -| rimm(r-l, (MF)- 1 ) mmid' 1 / 2 , 4(1 - d 1 / 4 ) 2 ) r , 
sup sup E||/ - /II 2 > i ^ l± — vra — + n m }. 



Upper bound The next theorem summarizes sufficient conditions to ensure the optimality 
of the orthogonal series estimator defined in (1.2) provided the dimension parameter k 
is chosen appropriately. To be more precise, we use the value A;* defined in (2.1) which 
obviously depends on the sequences uj, 7 and A but surprisingly not on the E-sample size m. 
However, under this choice the estimator attains the lower bound given in Corollary 2.4 up 
to a constant and hence it is minimax-optimal. 

Theorem 2.5 Suppose independent iid. sample from Y and e of size n and m, respectively. 
Consider sequences uj, 7 and A satisfying Assumption 2.1. Let f^* be the estimator given 
in (1.2) with k* n defined in (2.1). Then, there exists a numerical constant C > such that 
for all n, m 1 we have 

sup sup \n\f K -f\\i\ < C{{d + r)i; n + drK m }. 

Note that under slightly stronger conditions on the sequences uj, 7 and A than Assump- 
tion 2.1 it can be shown that in case of equally large samples from Y and e we have always 
the rate as in case of known error density. However, below we show that in special cases 
the required e-sample size can be much smaller than the Y-sample size. 



2.1. Illustration: estimation of derivatives. 

To illustrate the previous results we assume in the following that the deconvolution density / 
is an element of the Sobolev space of periodic functions W p , p£N, given by 

W p = {/ G H s : /C0(u) = /W(l), j = 0, 1, . . . ,p - l}, 

where H p := {/ £ L 2 [0, 1] : f^~^ absolutely continuous, f^ e L 2 [0, 1]} is a Sobolev space 
(c.f. Neubauer (1988a, b)). However, if we consider the sequence of weights 

70 = 1 and 7i = b'| 2p , Ul > 0, 

then, the Sobolev space W p of periodic functions coincides with F w . Therefore, let us 
denote by Wp := F r w , r > 0, an ellipsoid in the Sobolev space W p . In this illustration, we 
shall consider the estimation of derivatives of the deconvolution density /. Therefore, it is 
interesting to recall that, up to a constant, for any function h € WI the weighted norm 
\\h\\ u with 

loo = 1 and ujj = |j| 2s , \j\ > 0, 

equals the L 2 -norm of the s-th weak derivative for each integer ^ s ^ p. By virtue 
of this relation, the results in the previous section imply also a lower as well as an upper 
bound of the L 2 -risk for the estimation of the s-th weak derivative of /. Finally, we restrict 
our attention to error densities being either 
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[os] ordinary smooth, that is, the sequence A is polynomially decreasing, i.e., Ao = 1 and 
Xj = \j\~ 2a , \j\ > 0, for some a > 1/2, or 

[ss] super smooth, that is, the sequence A is exponentially decreasing, i.e., Ao = 1 and 
Xj = exp(— |j| 2a ), \j\ > 0, for some a > 0. 

It is easily seen that the minimal regularity conditions given in Assumption 2.1 are satisfied. 
Moreover, the additional conditions used in Theorems 2.2 and 2.3, i.e., T = X^-g^T^ 1 < oo 

and that there there exists ip E £^ with ip 1/2, are satisfied in the super smooth case [ss] 
if p > 1/2 and in the ordinary smooth case [os] if in addition a > 1. Roughly speaking, this 
means that both the deconvolution density and the error density are at least continuous. 
The lower bound presented in the next assertion follows now directly from Corollary 2.4. 
Here and subsequently, we write a n < b n when there exists C > such that a n ^ Cb n for 
all sufficiently large n £ N and a n ~ b n when a n < b n and b n < a n simultaneously. 

Proposition 2.6 Suppose independent iid. sample from Y and e of size n and m, respec- 
tively. Then we have for any estimator p s "> of f^ s > 

[os] in the ordinary smooth case, for all p > 1/2 and a > 1 that 

SUp SUp {e||/W - /( s )|| 2 ) > „-2(p--)/(2p+2«+l) +m -((p-5)A«)/ a) 

[ss] in i/ie super smooth case, for all p > 1/2 that 

sup sup {e||/W - /( s )|| 2 ) > (\ogn)' ip - s ^ a + (logm)-( p - s )/ a . 
few- L J 

As an estimator of / W ; we shall consider, the s-th weak derivative of the estimator defined 
in (1.2). Given the exponential basis {&j}j£Z, we recall that for each integer ^ s ^ p the 
s-th derivative in a weak sense of the estimator /& is 

f;r ^i2/-/r.A;, (/ . (2.3) 

Applying Theorem 2.5, the rates of the lower bound given in the last assertion provide, up 
to a constant, also an upper bound of the L 2 -risk of the estimator which is summarized 
in the next proposition. We have thus proved that these rates are optimal and the proposed 
estimator /£ is minimax optimal in both cases. Furthermore, it is of interest to characterize 
the minimal size m of the additional sample from e needed to attain the same rate as in case 
of a known error density. Hence, we let the e-sample size depend on the Y-sample size n, too. 

Proposition 2.7 Suppose independent iid. sample from Y and e of size n and m, respec- 
tively. Consider the estimator given in (2.3). 

[os] In the ordinary smooth case, with dimension parameter k ~ n 1 /( 2 P+ 2a + 1 ) we have 



sup sup 



|E||^ S) - / (s) || 2 | < n -2(p-s)/(2p+2a+l) +m -((p-s)Aa)/a 
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and for any sequence (m n ) n ^i follows as n —> oo 



sup sup |E||./f -/W|| 2 ) 



( n -2(p-a)/(2p+2a+l)) ^ n 2((p-s)Vo)/(2p+2o+l) = 0(m n ) 

U{m n ) otherwise. 



[ss] In the super smooth case, with dimension parameter k ~ (log have 
sup sup {e||^ s) - / (s) || 2 ) < (logn)-( p - s )/ a + (logm)-( p - s )/ a 

and for any sequence (m n )n^l follows as n — > oo 

/lRMI?W fWl|2\ JO((logn)-(^)/ a ) ^/logn = 0(logm n ) 
sup sup |E||/^ \ = < ^-(p- s Va^ rt • 

/ew^g^ L J [0((logm n ) ^ p ^ /a ) otherwise. 

Note that in the ordinary smooth case we obtain the rate of known error density whenever 
n 2((p-s)vo)/(2p+2a+i) _ Q^ mn ^ which j s much less than n = m. This is even more visible in 
the super smooth case, here the rate of known error density is attained even if m n = rf for 
arbitrary small r > 0. Moreover, we shall emphasize the influence of the parameter a which 
characterizes the rate of the decay of the Fourier coefficients of the error density ip. Since a 
smaller value of a leads to faster rates of convergence, this parameter is often called degree 
of ill-posedness (c.f. Natterer (1984)). 



3. A model selection approach: known degree of ill-posedness 

Our objective is to construct an adaptive estimator of the deconvolution density /. Adap- 
tation means that in spite of the unknown error density, the estimator should attain the 
optimal rate of convergence over the ellipsoid J-"^ for a wide range of different weight se- 
quences 7. However, in this section partial information about the error density (p is supposed 
to be available. To be precise, we assume that the sequence A and the value d such that 
ip € are given in advance. Roughly speaking, this means that the degree of ill-posedness 
of the underlying inverse problem is known. In what follows, the orthogonal series estima- 
tor /fc defined in (1.2) is considered and a procedure to choose the dimension parameter k 
based on a model selection approach via penalization is constructed. This procedure will 
only involve the data and A, d, and ui. First, we introduce sequences of weights which are 
used below. 

Definition 3.1 

(i) For allk ^ 1, define A k := max ^u|^jfc U)j/Xj, T k ■= maxo^ui^(o; J -)vi/Aj with (</)vi := 
max(g, 1) and 

log(fc + 2) 

Let further £ be a non- decreasing function such that for all C > 



, fclogfa V(fc + 2)) ^ , . 
Tfc exp I — — — — — I S(C) < 00. (3.1) 



; i 3Clog(k + 2 
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(ii) Define two sequences N and M as follows, 



N n := N n (X) := max{l ^ N ^ re \ 8 N /n < Si}, 
M m :=M m (X,d) 



max < 1 ^ M ^ 



77? 



m exp (" w) < 



itiAaA ^ /504(i x 7 



A I 



It is easy to see that there exists always a function S satisfying condition (3.1). Consider the 
orthogonal series estimator defined in (1.2). The adaptive estimator fa is now obtained 
by choosing the dimension parameter k such that 



k:= argmin { -\\fk\\l + 60 d - }> . (3.2) 



7? 



Next, we derive an upper bound for the risk of this adaptive estimator. To this end, we 
need the following assumption. 

Assumption 3.2 The sequence M satisfies d _1 mini^u-i^f m Xj 2/m for all 1. 

By construction, this condition is always satisfied for sufficiently large m. 

Theorem 3.3 Assume that we have independent iid. Y - and s-samples of size n and m, 
respectively. Consider sequences uj, 7, and X satisfying Assumption 2.1. Let 5, A, N, and 
M as in Definition 3.1 and suppose that Assumption 3.2 holds. Consider the estimator fa 
defined in (1.2) with k given by (3.2). Then, there exists a numerical constant C > such 
that for all re, m ^ 1 

sup sup 4 EJ[| fa- f\\l\ < C\ (d + r) min {max(u; fc /7fc, S k /n)} + dr K m 

f&Sveet 1 J I l<fc<(JV n AAf m ) 



ft (ci/Ai) 7 / 2 ft + S(rdA) 
m n 



where A := 5^,- 6 z Aj and K m is defined in Theorem 2.3. 

Comparing the last assertion with the lower bound given in Corollary 2.4, we immediately 
obtain the following corollary. 

Corollary 3.4 Suppose in addition to the assumptions of Theorem 3.3 that the optimal 
dimension parameter fc* given in Theorem 2.2 is smaller than N n A M m . If further £ := 
sup fc ^ 1 {5fc/(X]o<|j|sgA: ^j/^j)} < °°> then there is a numerical constant C > such that 



sup sup {m7t-f\\l} < c(e(d+r)V„+dr Km +d 



ft (d/Ai) 7 / 2 | ft + S(rdA) 
777, re 



where A := Yljez^j an d K m ^ s defined in Theorem 2.3. 

Under the additional conditions the last assertion establishes the minimax-optimality of 
the partially adaptive estimator, since its upper risk-bound differs from the optimal one 
given in Corollary 2.4 only by a constant and negligible terms. However, these additional 
conditions are not necessary as shown below. 
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3.1. Illustration: estimation of derivatives (continued) 

In section 2.1, we described two different cases where we could choose the model k such that 
the resulting estimator reached the minimax optimal rate of convergence. The following 
result shows that in case of unknown error density tp £ £^ with a-priori known A and d, the 
adaptive estimator automatically attains the optimal rate over a wide range of values for 
the smoothness parameters. 

Proposition 3.5 Assume that we have independent iid. Y- and e-samples of size n and m n , 
respectively. Consider the estimator given in (2.3) with k defined by (3.2). 

[os] In the ordinary smooth case, we have 

A k = k^°, 6 k ~k 2 "+ 2 °+\ iV n ^ n l/(2a + 2 S+ l) ) M f^_) 1/(2a \ 

Vlog m n J 

In case p — s > a we obtain 

r o(n -2( P - S )/(2 P +2a+l) ) i f n 2(j>-s)/(2p+2a+l) =Q r ) 

sup sup{E||f)- / W|| 2 } = 

fewr^gd <- J lO^ 1 ) otherwise, 

and in case p — s ^ a, if n 2a /( 2 P+ 2a + 1 ) — 0(m n ) 



sup sup {E||^ s) -/W|| 2 } 



A 



' ( n -2(p-s)/(2p+2a+l)) i f n 2a/(2p+2a+l) = O (m n / log m n ) 

0(m n ^ p s ^ a (logm n )( p ~ s ^ a ) otherwise, 



while ifm n = ( n 2a /( 2 P +2a+1 )) 



sup sup {e|| £ s) - f^\\ 2 } = 0{m-^- s ">l a {\ogm n ) 



\(p-s)/a\ 



[ss] In the super smooth case, we have 

A k = k 2s exp(k 2a ), 5 k ~ fc 2a+2s+1 exp(fc 2a )(logA:)- 1 , 

/ n log log n \ 1/(2a) f, rn n \ 1/(2a) 

and 



sup supIeH^-Z^H 2 ) 



0((logn)~( p - s )/ a ) if logn = 0(logm n ) 
0((logm n )~( p ~ s )/ a ) otherwise. 



Compare this result with Proposition 2.7. In case [ss], the adaptive estimator mim- 
ics exactly the behavior of the minimax optimal non-adaptive estimator, even though 
£fc/G^o<l?|<fc ~ k 2a+1 /logk is not bounded and hence the assumptions of Corol- 
lary 3.4 are violated. In case [os], if additionally p — s > a, the adaptive estimator still 
behaves like its minimax optimal non-adaptive counterpart. However, if p — s ^ a, the 
sequence (m n ) n ^i must grow a little faster than in the non-adaptive case. Otherwise, the 
convergence is slowed down by a logarithmic factor. 
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4. Unknown degree of ill-posedness 



In this section, we dispense with any knowledge about the error density <p, that is, A and d 
are not known anymore. We construct an adaptive estimator in this situation as well. 
Recall that in the previous section, the dimension parameter k was chosen using a criterion 
function that involved the sequences N, M, and 5 which depend on A and d. We circumvent 
this problem by defining empirical versions of these three sequences at the beginning of this 
section. The adaptive estimator is then defined analogously to the one from Section 3, but 
uses the estimated rather than the original sequences. 

Definition 4.1 Let 5 := (8 k ) k>1 , N := (N n ) n> i, and M := (M m ) 

m^i be as follows. 

(i) Given A k := max ^-l{|[w] 7 | 2 ^ 1/m} and ? k := max \(pU 2 ^ 1/m} 

let 

~ _ ~ log(f fc V (k + 2)) 

h ■= kA k — . 

log(fc + 2) 

(ii) Given N% := argmax 0<Ars;n { maxo<j<;Ar Uj/n ^ l} let 

Tr ■ f IMil 2 lognl -r= . f . f — ; l2 (logm) 
N n : = argmm <^ -\ — < ^, and M m := argmin ^ <p . < 

o<L?Kjv« I bIKOvi n J o<bKm I J m 

It worth to stress that all these sequences do not involve any a-priori knowledge about 
neither the deconvolution density / nor the error density ip. Now, we choose k as 

k:= argmin ( - ||/ & || 2 + 600 ^1. (4.1) 



0<fcsS(7V„AM m ) 



Note that in contrast to the previous section, this choice does not depend on the se- 
quences 5, N, or M, but only on 5, N, and M, which can be computed from the observed 
data samples. This choice of the regularization parameter is hence fully data-driven. The 
constant 600 arising in the definition of k, though convenient for deriving the theory, may 
be far too large in practice and instead be determined by means of a simulation study as 
in Comte et al. (2006), for example. 

In order to show an upper risk bound, we need the following assumption. 
Assumption 4.2 

(i) The sequences N and M from Definition 3.1 (ii) satisfy the additional conditions 

Xj logn (logm) 2 

max — - — ^ — and max A,- ^ : . 

j>N n j(uj) V i 4dn j>M m Mm 

(ii) For all n G N, N% given in Definition 4-1 (ii) fulfills N n ^ ^ n. 

By construction, these conditions are always satisfied for sufficiently large n and m. We are 
now able to state the main result of this paper providing an upper risk bound for the fully 
adaptive estimator. 
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Theorem 4.3 Assume that we have independent iid. Y - and e-samples of size n and m, 
respectively. Consider sequences ui, 7, and A satisfying Assumption 2.1. Let the sequences 5, 
N, and M be as in Definition 3.1 and suppose that Assumptions 3.2 and 4-2 hold. Define 
further N l n := argmax { j^) vl ^ 4rf *° g - } and M l m : = argmax {Aj ^ 4d(logm) j Consider 



the estimator /t defined in (1.2) with k given by (4.1). Then there exists a numerical 
constant C such that for all n, m ^ 1 



sup sup \ E||A - f\\l \ ^ C\ (r + d( d ) min {max(aj k /7k, o~k/n)} + dr k Ti 



+ d 



(fr+r)(d/Ai) 7 | fr + S(rrfACd) 



m n 



where A := X^ezA?' Cd := l°g3d/log3, and K m is defined in Theorem 2.3. 
Comparing the last assertion with Theorem 3.3, we assert that surprisingly, the estimation 
of the sequences S, N, and M essentially changes the upper bound only by replacing N and 
M by N l and M , respectively. Therefore, in analogy to the results in section 3, we have 
the following corollary. 

Corollary 4.4 Suppose that in addition to the assumptions of Theorem 4-3 we have that the 
optimal dimension parameter fc* given in Theorem 2.2 is smaller than JV^AM^. If further 
£ := sup fc ^ 1 {5fc/(^o<|j|^fe < °°> then there is a numerical constant C > such that 

- / w r < C\i{dCd+r)i> n +dr n m +dQ d 1 

K J L m n 



sup sup 

where A := J2jez^j> Cd '■= log 3d/ log 3, and K m is defined in Theorem 2.3. 
Under the additional conditions the last assertion establishes the minimax-optimality of the 
fully adaptive estimator, since its upper risk-bound differs from the optimal one given in 
Corollary 2.4 only by a constant and negligible terms. However, these additional conditions 
are not necessary as shown below. 

4.1. Illustration: estimation of derivatives (continued) 

The following result shows that even without any prior knowledge on the error density </?, 
the fully adaptive penalized estimator automatically attains the optimal rate in the super 
smooth case and in the ordinary smooth case as far as p—s a. Recall that the computation 
of the dimension parameter k given in (4.1) involves the sequence (N^) n ^i, which in our 
illustration satisfies ~ n 1 ^ 2 ^ since u)j = \j\ 2s , j 1. 

Proposition 4.5 Assume that we have independent iid. Y- and e-samples of size n and m, 
respectively. Consider the estimator p?' given in (2.3) with k defined by (4.1). 

k 

[os] In the ordinary smooth case with p — s > a we obtain 



sup sup ' ii' 11 < - f^W 2 



' 0( n -2(p- S )/(2p+2a+l)) i f n 2(p-s)/(2p+2a+l) = (m n ) 



/6W^ e £d i- J \Oim~ 1 ) otherwise, 
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and with p — s ^ a, if n 2a /( 2 P+ 2a + 1 ) = 0(m n ) 
sup sup {e||£ s) -/M|| 2 ) 

_ f 0( n -2(p-s)/(2p+2a+l)) i/n 2a/(2p+2a+l) = 0(m n /(log m n ) 2 ) 

lo(m„' p s ^ a (logm n ) 2 ( p_s ) //a ) otherwise, 

While ifm n = ( n 2a/(2p+2a+l)) 

sup sup |e||/I s) - /W|| 2 ) = 0(m;( p - s )/ a (logm n ) 2(p - s)/a ). 

[ss] In f/ie super smooth case, we have 

sup sup (EH/1" - /<.)||n = {0((l°Sn>-<--'°) if logn = Oflogm.) 
/eW^^efjJ fc |0((logm ra ) ^ s " a ) otherwise. 

Notice that the last result differs from Proposition 3.5 solely in case [os] with p—s ^ a, where 
(logm n ) is replaced by (logm n ) 2 . Hence, in all other cases the fully adaptive estimator 
attains the minimax optimal rate. In particular, it is not necessary to know in advance 
if the error density is ordinary or super smooth. Moreover, as long as m n ~ n, the fully 
adaptive estimator always attains the same optimal rate as in case of known error density. 
However, over a wide range of values for the smoothness parameters, the minimax optimal 
rate is still obtained even when m n grows slower than n. 

A. Proofs 

A.l. Proofs of section 2 
Lower bounds 

Proof of Theorem 2.2. Given ( := r/min(r— 1, l/(8<fT)) and a n '■= ^n(X^o<b'|<fc* w j/(-\?' n ))~ 1 
we consider the function / := 1 + (£a n /ra) 1//2 ^o<|j|^fc* l ^ e r We wn l show that for any 
9 := (9j) G {—1, l} 2fc ™, the function fg := 1 + ^o<|j|<fc* ®j[f]j e j belongs to and is hence 
a possible candidate of the deconvolution density. For each 9, the Y-density corresponding 
to the X-density fe is given by gg := fg * if. We denote by gg the joint density of an i.i.d. 
n-sample from gg and by Eg the expectation with respect to the joint density g^. Further- 
more, for < \j\ ^ A;* and each 9 we introduce 9^ by 9^ = 9\ for j ^ I and 9j = —9j. 

The key argument of this proof is the following reduction scheme. If / denotes an estimator 
of / then we conclude 

su P E||7-/|| 2 > sup E ||/-/ e || 2 > V TEeWf-feWl 

E E ^e\[J-fe] 3 \ 2 
ee{-i,i} 2fc n o<|i|<fc* 

= ^K E E y{^|[/-/^l 2 + E^)|[7-/^)]il 2 }- 

6»e{-l,l} 2fe " 0<|i|sSfc* 
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Below we show furthermore that for all n > 2 we have 



{Mo\[f-fe]j\ 2 +%,•)![/- fea)]j\ 2 } 



(A.l) 



Combining the last lower bound and the reduction scheme gives 



sup n\f-ft> 



1 



2 2fc n 



£ £ 

0e{-i,i}«£ o<|iKfc* 



c = c 

2 4A,n 8 



a, 



£ 



A 9 n 



Hence, employing the definition of £ and a n we obtain the lower bound given in the theorem. 
To conclude the proof, it remains to check (A.l) and fg G Tl f for all 9 6 {—1, l} 2fc ™. The 
latter is easily verified if / € .7T. In order to show that / G J 7 ^, we first notice that / 
integrates to one. Moreover, f is non-negative because | ]Co<|j|^fc* [/]j e .?'l ^ -*-> an d II/II7 ^ r i 
which can be realized as follows. By employing the condition Yljezlj' 1 = T < 00 we have 



E i/^-k E iWii 

o<|j|<fc£ o<|i|<fc* 



■n 



E *. 



-1/2 



1/2 



E 



7 7 - 



1/2 



E 



nA, 



1/2 



1/2 



E 



0<|j|<fc* 0<|j|<fe* J 0<|j|<fc* 

Since w/7 is non-increasing the definition of C, a n and ?? implies 



1/2 



E [/li e iK(cr 

0<[j[<fc* 



V2/7fe 



-ftr 



E 



0<[i[<fc* 



A 9 n 



/Cr\i/2 

< — I ^ 1 
V 7/ 



(A.2) 



as well as ||/||* O + 



Q<|i|<fc* jiA, 



< 1 + C/?? < r. 



It remains to show (A.l). Consider the Hellinger affinity p{g$,ga(j)) = J \/~9e\/9gu) > then 
we obtain for any estimator / of / that 



Cj). 



|[/ - 



\[fe-f e u)]j\V 9m 

\[f-feiM\n ^ 1 / 2 
12 



+ 



[f-fe 



life - feu)]j\ 

|[/-HI 2 
life ~ feu)]j\ 



g e y %j) 
1/2 



[/e _ fe<J)\j\ 
Rewriting the last estimate we obtain 

[E e \[f- fehl 2 + %.)|[7- /^)],f} > - /jwlil 2 ^,^))- 



(A.3) 



Next we bound from below the Hellinger affinity p{9$ , 9^)) • Therefore, we consider first 
the Hellinger distance 



H 2 (ge,g e u 



\f9~o ~ \ZdeW) 

ge - 9eU) 
V9 e + y/9 e m 



^ 4||5e 



i 2 = m\n^i 2 ^, 
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where we have used that a n ^ I/77, ip G £jj and ^ ^ 1/2 because | ^o<b'|<fc* [#0]i e il ^ 1/2, 
which can be realized as follows. By using the condition X^ezT-T = T < 00 and G £jj 
we obtain in analogy to the proof of (A. 2) that 

E bw< E E \- 1/2 <(f)' /2 «v2. 

0<|i|<fc* 0<|i|<fc* 0<|j|<fc* 

Therefore, the definition of ( implies H 2 (gg,gg(j)) ^ 2/n. By using the independence, 
i-e-, p{gg,gg U )) = p(ge,g e u)) n , together with the identity p(ge,g u)) = l ~ \H 2 {ge,g d (j)) it 
follows p(gg,ga(j)) ^ (1 — n _1 ) n ^1/4 for all n ^ 2. By combination of the last estimate 
with (A. 3) we obtain (A.l) which completes the proof. □ 

Proof of Theorem 2.3. We construct for each 9 G { — 1, 1} an error density ipg G and 
a deconvolution density fg G J 7 ^, such that gg := fg * ipg satisfies g\ = g-\. To be more 

precise, define k* m := argmaxi^glwjT - ' 1 min(l, m~ 1 A~ 1 )} and a m := £ min(l, m~ 1//2 A fc » 1 ^ 2 ) 
with C := min(l/(2Vd), (1 - d" 1 / 4 )). Observe that 1 > (1 - a m ) 2 > (1 - (1 - l/d 1 ^))! ^ 
1/d 1 / 2 and 1 < (1 + a m ) 2 «S (1 + (1 - l/d 1/4 )) 2 = (2 - 1/ci 1 / 4 ) 2 < d 1 / 2 , which implies 
1/d 1 / 2 (l + #a m ) 2 ^ d 1 / 2 . These inequalities will be used below without further reference. 
By assumption there is a density tp G such that ip ^ 1/2. We show below that 
for each 9 the function fg := 1 + (1 — 9a m ) mm ^ 1 r / ^ 1 ' 1 ^ r ) h }^ 2 &k* belongs to J 7 !,' and the 
function ipg := (p + 9ot m [(p]k* e^* is an element of Moreover, it is easily verified that 

99 = 1 + (1 - a m) mm( ]v4^ ^y 2 M k* m £k* m and hence 51 = We denote by g# the joint 
density of an i.i.d. n-sample from gg and ip™ the joint density of an i.i.d. m-sample from ipg. 
Since the samples are independent from each other, pg := ggP™ is the joint density of all 
observations and we denote by Eg the expectation with respect to pg. Applying a reduction 
scheme we deduce that for each estimator / of / 

sup su P E||/-/|| 2 ^ max E e ||/-/,|| 2 ^ UexIIJ-ZxII 2 +E_ 1 ||J-/_ 1 || 2 ). 
f^Wesi ee{-i,i} 2 1 J 

Below we show furthermore that for all m ^ 2 we have 

ExH/ - Ml 2 + E_i||/ - /-i|| 2 > \\h ~ /-ill'- (A.4) 

Moreover, we have \\h - /-i|| 2 = 4a^ 7 ^^ = 4^ C 2 ^7^ min(l, 
Combining the last lower bound, the reduction scheme and the definition of k* m implies the 
result of the theorem. 

To conclude the proof, it remains to check (A.4), fg G J 7 ^ and ipg G £^ for both 9. In order 
to show fg G J-y, we first observe that fe integrates to one. Moreover, fg is non- negative 

because |(1 - 9 am ) 1 -^^ lk } /2 \ < lk }J 2 < 1 and ||/ e || 2 = 1 + 7fcjJ[/«]fcsJ 2 < 1 + ^K 1 " 
fla m ) 1A X / / , 4 ~ 1 7 J u* 1 ^ 2 | 2 ^ r. Consider which obviously integrates to one. Furthermore, 
as ip ^ 1/2 the function <p e = ^ + feml^^ej.; is non-negative since \9a m [ip]k* m e k * m \ < 
a m A^ (f 1 / 2 < C^ 1/2 Vd < 1/2 by using the definition of a m and £. To check that ipg £ 

m 

it remains to show that 1/d ^ [pe] 2 /^j ^ d for all |j| > 0. Since (p G £x , it follows from 
the definition of (pg that these inequalities are satisfied for all j 7^ k* m and moreover that 
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i /d ^ n^J! ^ d+^) a jM^i a ^ ^|k/ ^ dt Finally consider (A . 4) . As in the proof of 

Theorem 2.2 by employing the Hellinger affinity p(pi,p-\) we obtain for any estimator / 
of / that 



{Ei||/ - fi\\l + - /ill" } > ^ll/i - /-i|£p(pi,P-i). 



Next we bound from below the Hellinger affinity p(p\,p-i) ^1/4 for all m 2 which 
proves (A. 4). From the independence and the fact that gi = g~x, it is easily seen that 
Hellinger affinity satisfies p(pi,p-i) = p(g x , g-i) n p(ipi, <f-i) m = p{<PX,¥-i) m = ll 



)m 
. Hence, we conclude p(pi,P-i) ^ (1 — l/m) m ^ 1/4, for all m ^ 2, 



l cr2 

2 

since 



£r 2 ( V i j¥ >_i) < 



< 2 / |^i - <p_ij 



< 2 y Aa 2 m \\ip) k *fel* m ^ 8da 2 m X k?n = Sd^m' 1 ^ 2m" 1 
where we used that <p ^ 1/2 and the definition of a m and £• This completes the proof. □ 



Upper bound 

Proof of Theorem 2.5. We begin our proof with the observation that Var([g]j) ^ 1/n and 
Var([<^]j) ^ 1/m for all j € Z. Moreover, by applying Theorem 2.10 in Petrov (1995) there 



exists a constant C > such that E| — [<p]j\ K C/m for all jgZ and m G N. These re- 
sults are used below without further reference. Define now / := l + ]Co<|j|^fc* [/li-Hll^lil 



l/m}ej and decompose the risk into two terms, 

E||/ - f\\l < 2E||/ - J\\l + 2E||/ - f\\l =:A + B, 
which we bound separately. Consider first A which we decompose further, 

'|[?]i-bL- 12 



(A.5) 



E||/-/||^2 

0<\j\<k* 



■ |2 



■mm^^i/m} 



+2 Yl w ii[/]ii 2E 

0<\j\<k* 



\W\j - Mil 



• |2 



■l{|[^| 2 ^l/m} 



=: Ai + A 2 . 



By using the elementary inequality 1/2 ^ I — 1| 2 + |[?]j/Mj| 2 > the independence 
of <p and g, and <p € together with the definition of ip n given in (2.1), we obtain 



A x < 4 ^ a 

o<]j|<fc* 

Moreover, we have E 

2(C+l)d 
mA,- 



m"Var([^]j) Var([^] 



, i ¥ar([ff]j 

I i r i i o 



-}^8d Y 



0<\j\<k* 



l[y]j-Mjl 2 Tl rir,C-l .|2 ^ -i ^ 2mE|[y] J -M J -| 4 _,_ 2Var([y] J ) < 2(C+1) 



and J jl " l{|[y] 3 -| 2 ^ 1/m} < 1, where we have used again the elementary 



W" 



■t{\W > iM < 



+ 
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inequality and ip € Sf. By combination of both bounds together with / £ and the 
definition of K m given in (2.2) we obtain 



A 2 ^4(C + l)d V u) j \[f] j \ 2 mm(l,- i —)^4(C + l)drK m . 

0<\j\4k* 3 

Consider now B which we decompose further into 

nl- f\\l = E "Mj\ 2 (i - i{o < lil < KMMjl 2 > V™}) 2 

o<UI 

= E^'I[^'| 2 + E ^l[/bl 2 P(l[^f <Vm) =:£i+i?2, 

where i?i ^ ll/ll^fcjTfc* 1 ^ r ^n because / 6 J 7 ^. Moreover, i?2 ^ 4drK m by using that 

1 7 -| 2 < 1/m) 4dmin(l, — — ), (A. 6) 

/ mXj 

which we will show below. The result of the theorem follows now by combination of the 
decomposition (A. 5) and the estimates of Ai,A 2 ,Bi and B 2 . 

To conclude, let us prove (A. 6). If | {(f]j\ 2 ?S 4/m, then we deduce by employing Tchebychev's 
inequality that 

P(|M/ < l/m) ^ P(|H-/Mil < V2) < POM,- " Mil > IH-l/2) 

¥ar(M ) 
^ 4 |r 1 J ^ 4d/(mAj). 



IM 



On the other hand, in case |[y?]j| 2 < 4/m the estimate P(| Yf\j\ 2 < l/m) ^ 4d/(mAj) holds 

too since 1 ^ 4/(m| [tp]j | 2 ) ^ 4d/(mAj). Combining the last estimates and P(|[y]j| 2 < 
l/m) ^ 1 we obtain (A. 6), which completes the proof. □ 



Illustration: estimation of derivatives 

Proof of Proposition 2.6. Since for each ^ s ^ p we have — /^^|| 2 ~ E||/ — /|| 2 

we intend to apply the general result given Corollary 2.4. In both cases the additional 
conditions formulated in Theorem 2.2 and 2.3 are easily verified. Therefore, it is sufficient 
to evaluate the lower bounds ip n and n m given in (2.1) and (2.2), respectively. Note that the 
optimal dimension parameter /c* := argminj gN {max(^, J]o<|i|^i nt^ satisfies nuj^/jf.* ~ 
J2o<\i\<k* since both sequences (ij/uij) and (X)o<U|<j ^) are non-increasing, 

[os] The well-known approximation Ej=i J r ~ m r+1 for r > implies 
(7fc*M*)£ <|^fc*^M« ~ (k* n ) 2a+2 P +1 . It follows that k* n ~ n V(ap+a«+i) and the first 
lower bound writes ifi n ~ n -(2p-2s)/(2p+2a+i)_ M oreover) we h ave Km ^ m -(\ps]^a)/a^ s [ nce 
the minimum in K m = sup JgZ {|j| _2 ^ p_s - ) min(l, |i| 2a /m)} is equal to one for \j\ ^ m l l 2a 
and \ j\~ 2 ^ p ~ s ^ is non-increasing. 

[ss] Applying Laplace's Method (c.f. chapter 3.7 in Olver (1974)) we have 

(7fe*M* ) Eo<|Z[<fe* <V A * ~ (^) 2p exp(|fc;| 2a ) which implies that £;* ~ (logn) 1 /^) an d 

that the first lower bound can be rewritten as t/j n ~ (\ogn)~( p ~ s ^ a . Furthermore, we have 
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K m ~ (logm) ( p s )/ a since the minimum in K m = supj gZ {|j| 2 ( p s ^ min(l, exp(|j| 2a )/m)} 
is equal to one for \j\ ^ (log to)^/ 2 ") and lil -2 ^ - s ^ is non-increasing. Consequently, the 
lower bounds in Proposition 2.7 follow by applying Corollary 2.4. □ 

Proof of Proposition 2. 7. Since in both cases the condition on the dimension parameter k 
ensures that k ~ /c* (see the proof of Proposition 2.6) the result follows from Theorem 2.5. 

□ 



A. 2. Proofs of section 3 

We begin by defining and recalling notations to be used in the proof. Given u £ L 2 [0, 1] we 
denote by [it] the infinite vector of Fourier coefficients [u]j := (u,ej). In particular we use 
the notations 



/*= E > 1/™}*, Ik := E ^ f k := £ ^ 



$ n := E &l{IHf Vm}e, 5 := E W^r 

Furthermore, let g be the function with Fourier coefficients \g]j := [g]j. Given 1 ^ k ^ k' 
we have then for all t E Sj. := span{e_fc, . . . , e^} 

k k 



i=l j=—k 

(tjk'h = (t,%) u = l n Y, E ^(-^)^1{IH-| 2 > l/m} = <*,&>„. 

i=l j=~k VP\j 

Consider the function v = g — g with Fourier coefficients [u]j = [g\- — [g]j = [g\- — E[g] •, 
then we have for every t € Sk, 

(t, - f)u = (t, $g - $g)ul = (t, $9 ~ $g)uj + (t, $g ~ ®g)u 

= (t, §v)u + (t, 8 ? - &3) u = (t, + (t, 8„ - + (t, $ 9 - $ g ) u . (A.7) 

At the end of this section we will prove three technical Lemmata (A. 2, A. 4 and A. 3) which 
are used in the following proof. 

Proof of Theorem 3. 3. We consider the contrast 
T(i):=||i|| 2 -2(t,$ ? ) w , Vt<EL 2 [0,l]. 

Obviously it follows for all t E Sj~ that T(t) = \\t — fk\\t ~ \\ fk\\Z> an< ^) hence 

argminT(i) = f k , Mk > 1. (A. 8) 

teSk 
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Moreover, the adaptive choice k of the dimension parameter can be rewritten as 



k 



argmin <^ T(/ fc ) + 60- 



n 



(A.9) 



l<fc<(iV n Aitf TO ) 

Let pen(/c) := GOdSk/n, then for all 1 ^ k ^ (N n A M m ) we have 

T(4) + pen(fc) < T(A) + pen(A;) < T(/ fc ) + pen(A;), 
using first (A.9) and then (A. 8). This inequality implies 

WfrWl - WfkWl < 2(f% - fk, h)" + P en W - P en (fc)> 

and hence, using (A. 7), we have for all 1 ^ k ^ (N n A M m ) 

114 ~ fill < 11/ " /fell' + Pen(^) - pen(fc) 

+ 2(4 - / fc , i„) u + 2(4 - A, $„ - + 2(4 - / fcl $ 5 - * fl ) w . (A.10) 

Consider the unit ball := {f £ ■ \\f\\u ^ 1} and, for arbitrary r > and t G Sfc, the 
elementary inequality 

fe 

2|(t,/i) iJ | < 2||t|| w sup|(t J /i) w | <r||t||2 +-sup|(t,/ l ) w | 2 = r||t||2 +1 V c^|[%| 2 . 

±^v-> T i .- to T L ■ 



j=-k 



Combining the last estimate with (A. 10) and 4 — G 5^ vfc C <Sjv„AM m we obtain 

114 - /II' < 11/ - Ml' + 3r H/fe - Ml" + P en W - P en W 

+ - SUp \(t,$ u ) u \ 2 + - SUp | $„>w | 2 + - SUp \(t,$ g - $g)u\ 2 - 



t tsB, 



(N n AM m ) 



Decompose \{t,$ v - $ v ) u \ 2 = \ {t, $ u - $„) u \ 2 l{Sl q } + \{t, $„ - ^) W | 2 1{^} further using 



:=^V0<|j| ^M„ 



1 1 



[p], [<p]j %\[<p]j 



A |M,.| 2 ^ 1/m 



(A.11) 



Since 1{|MJ 2 ^ l/m}l{fig} = it follows that for all 1 sC \j\ ^ (N n A M m ) we have 



1{|M ? | 2 > 1/m} - 1 = 1{0,} 



1 1 



<: 



Hence, su PteBfc |<t,8„ - ^) w | 2 < | su Pt6Bfe |(t,^) w | 2 for all 1 ^ k ^ (iV n A M„ 

Letting r := 1/8 it follows from \\% - / fc || 2 < 2\\fc - /|| 2 + 2||/ fc - /|| 2 that 



Wfk-fWl^^Wf-fkWl + m sup |(t,*„) w | 2 -(6dff fcvie J/n 

\teB kvk 

+ ( 60 dS, z)/n + pen(fc) - pen(£?) 



+ 8 sup |(t,8^-^) w | 2 l{^} + 8 sup \(t,$ g -$ 



(JV n AM m ) 



(iV n AM m ) 
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Since w/7 is non-increasing we obtain ||/ — fk\\t ^ ri ^k/lk for all f £ J 7 ^. Furthermore, 
notice that 60d5 kv ^/n = pen(fc V k) ^ pen(fc) + pen(fc). By taking the expectation on both 
sides we conclude that there exists a numerical constant C > such that 

sup sup E\\f % - f\\l <C(d + r) min (maxf^,^} 



l<fc<(iV n AM m ) <- 



Ik n 



+ C sup sup EC ( SU P 

f^^nd 1<lkl[<iNnAMm) \teB k , 



n 



+ C sup sup IE 



teB, 



sup Kt^-^y 2 !^} 



(iV n AM m ) 



+ C sup sup IE 



tee. 



sup |(i,8 s - $ 9 )J 2 



(]V n AM m ) 



In order to bound the second term, apply Lemma A. 2 with 5 k = ddj- and A£ = dA/%. Due 
to the properties of N n and of the function S from Definition 3.1, there is a numerical 
constant C > such that 



5>( S up|(t,^| 2 -6^ 



It is readily verified that ||y|| 2 ^ dA for all ip £ £^ and ||/|| 2 ^ r for all / 6 J 7 ^. The result 
follows now by virtue of Lemma A. 3, A. 4, A. 5, and Definition 3.1 (i). □ 



In the proof of Lemma A. 2 below we will need the following Lemma, which can be found 
in Comte et al. (2006). 

Lemma A.l (Talagrand's Inequality) Let T\,...,T n be independent random variables and 
v n( r ) = (V n ) SiLi [ r (Ti) — IE[r(Tj)]l , for r belonging to a countable class 1Z of measurable 
functions. Then, 



E[sup \u*(r)\ z - 6H£} + sC C - exp(-(raiZ£/6<;)) + -± exp{-K 2 (nH 2 /H 1 )) 
with numerical constants K 2 = (v2— l)/(21\/2) and C and where 



sup||r||oo ^ Hi, E 

r&TL 



sup\v n {r) 



1 n 

H 2 , sup - > Var(r(T;)) sC v. 



rell n 



i=l 



Lemma A. 2 Let (S k )k£Z and (A|)fc 6 2 be sequences such that 



UJj 



and 



0<\j\<k 



At ^ max 

ocliia 



and Zei K 2 := (\/2 - l)/(21 v / 2). T^en, t/iere is a numerical constant C > suc/i i/iai 



fc=l 



^ c 



Ve[( sup \{t,$ v ) u \ 



6 51 



t£B k 

2 II f\\2 N n 



11 



fc=i 



fc CX P 



6 \\ip 



1 \ 1 Nn ] 

ra p(^/ADj+ n 2 exp(-if 2V SE^] 
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Proof. For t £ S k define the function r t := X^xb'Kfc ^iWj'Mj e i' then it is readily seen 
that (t, <&j,)(j = ^ J2k=i r t(Yk) — E[rj(yjt)]. Next, we compute constants Hi, H2, and v 
verifying the three inequalities required in Lemma A.l, which then implies the result. 
Consider Hi first: 



fuplNlL = sup UjMjl 2 \ e j(v)\ 2 = Y Uj\[<f]j\ 2 



teB k 



Next, find Ho. Notice that 



E[sup \{t^ v ) u \ 2 } = - Yj w ilH-r 2 Var( ej (Yi)). 
* eSfe n 0<|j|<* 

As Var( ej (Yi)) < E[| ej -(Yi) | 2 ] = 1, we define E[sup teBfc \(t,$„)\ 2 } < 5* k /n =: flf . 
Finally, consider v. Given t E B k and a sequence (zj)j & z let [i] := ([£]_&, • • • , [t]k)~ 
note by D k (z) := diag[z_fc, . . . , z k ] the corresponding diagonal matrix. Define the I 

and positive semi-definite matrix A k := i[f>]j M^ 1 W\j-j' [f]j-j' ) , , , • Straightfor- 



j,j'=-k,...,k 

ward algebra shows sup t&Bfe Yav(r t (Yi)) ^ sup teB {A k D k (uj) [t], D k (ui)[t]) C 2k+i , hence 



1 n 

sup - V¥ar(r t (7 fc )) < sup<Aj /2 D fc (u;)[t],4 /2 D fc (a;)[t]) c2fc+ i 

sup||4 /2 ^fc(^)Mllc^+i = Pfe(v^)^ffc(v^)llc^+i 



k=i * eB * 



teB 



k 



Clearly, we have A k = D k ([ip] x ) B k D k ([ip] ), where B k := ([ip}j- k [f]j-k) j tk= _ ki ... )fc - 
Consequently, 

1 n 

sup - V, Var(r t (Y fc )) ^ ((^(v 7 ^ M _1 )Hc2fc+i ||-B fc || c2 fe+i. 

We have that \\D k {y/uj [<p]~~ 1 )\\1<2k+i = max o^|j|^fc w jlMjl~ 2 ^ ^ k - It remains to show the 
boundedness of ||-Bfc||c2fc+i- Let £ 2 be the space of square-summable sequences in C and 
define the operator B : £ 2 — > £ 2 by (Bz) k := SjezMj-fc [f]j-kZj, k G Z. Then it is 
easily verified that for any z 6 ^ 2 with \\z\\gi = 1, the Cauchy-Schwarz inequality yields 
||-B-z|| 2 2 ^ IMI 2 ll/ll 2 ; an d hence ||-B|| 2 2 ^ IM| 2 ||/|| 2 - Given the orthogonal projection IT^ 
in £ 2 onto S k the operator 11^511^ : 5/% — > S k has the matrix representation B k via the 
isomorphism S k = C 2k+1 and hence SIljt||^2 = ||Sfc||cafe+i- Orthogonal projections 
having a norm bounded by 1, we conclude that ||-Bfc||c 2fe + 1 ^ II ^\\e 2 f° r a U k £ N, which 
implies sup tgBfc - Ylk=i ^ aT ( r t(Y k )) ^ IM| 2 l|/l| 2 A£ =: v an d completes the proof. □ 

Lemma A. 3 There is a numerical constant C > such that for every k, m G N 



E 



SUP \{t,<f>g ~ 3> g )J 2 

t£B k 



< C dr K m (7, X,u). 



Proof. Firstly, as / G J 7 ^, it is easily seen that 



E 



sup I (t, $ 9 - $ 



|2 
g/w\ 



sup ^-E^l 2 ] 
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where Rj is defined by 
I [f]j 



R 



■3 •" 



m 



l{|U.|^l/m}-l 



(A.12) 



In view of the definition (2.2) of K m , the result follows from E[|i?j| 2 ] ^ C min |l, m ^ p |? 
which can be realized as follows. Consider the identity 



IE | j | 2 = E 



1{|M,.| 2 ^ 1/m} 



+ P[|[¥>]/ < 1/m] =:i?} + i?f . (A.13) 



Trivially, i?j 7 < 1. If 1 < 4/(m|[^i| 2 ), then obviously < 4 min |l, w|[ ^.p |. Other- 
wise, we have 1/m < |[^]j| 2 /4 and hence, using Tchebychev's inequality, 

rj — 4Var(Q.) r i 

where we have used that Var([y?] •) ^ 1/m for all j. Now consider R 1 -. We find that 

10, -Mil 2 



R\ = E 



l{|M,|^l/m} 



< mVar(|[<p],) < 1. 



(A.14) 



On the other hand, using that E[| [</?]■ — [</?]j| 4 ] Ss c/m? for some numerical constant c > 
(cf. Petrov (1995), Theorem 2.10), we obtain 



Rj < E 



l{|M,|^l/m}2 



IHI 2 ' ~ + Mj\ 2 



2mE[|M j -M J | 4 ] 2 ¥ar([^.; 



+ 



2c 2 



m\[tp]j\ 2 rn\[<p]j\ 2 ' 

Combining with (A.14) gives Rj ^ 2(c + 1) min |l, m ^ p |, which completes the proof. □ 
Lemma A. 4 There is a numerical constant C > suc/t 



E 



tee. 



sup |(t,$,-^) w l{0^}| 2 



(iV„AM m ) 



^ CcMiCPtns])^. 



Proof. Given with Rj from (A.12) we begin our proof observing that 



E 



sup |(t,^-$,) w l{^}| 2 



£ 



o<|j|<(iV„AJW m ) nriJ 



and using the independence of the two samples and Var([g] •) ^ n 1 . Since dJfc ^ J] 
for all (p £ £f, the Cauchy-Schwarz inequality yields 



0<\j\^k |[^-|2 



E 



sup |(t,^-^) w l{^}| 2 



< d(p[ns])V^ ffi (EH*/]) 1 /*. 

n o<|j|<JV„ 
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Proceeding analogously to (A. 13) and (A. 14), there exists a numerical constant C such that 
E[|ii,-| 4 ] C. The result follows then by Definition 3.1 (ii). □ 



Lemma A. 5 Consider the event Q q defined in (A. 11). We have P[H°] < 4(504 d/Ai) 7 m 
for all m ^ 1 . 

Proof. Consider the complement of Q q given by 



3 <\j\ sC M„ 



i 



> 5 V |M/ < l/m 



It follows from Assumption 3.2 that |[</?]j| 2 #5 2/m for all < |j| ^ M m . This yields 



fiJC 30 <|j| < M m : 



By Hoeffding's inequality, 

P[|0,/M J -l|>V3]^2exp 



> 



m 



72 



which implies the result by employing the definition of M m . 



(A.15) 
□ 



Illustration: estimation of derivatives 

Proof of Proposition 3.5. In the light of the proof of Proposition 2.6 we apply Theorem 3.3, 
where in both cases the additional conditions are easily verified and the result follows by 
an evaluation of the upper bound. 

[os] Let k* n := n i/(2a+2p+i) anc j note t j iat k^ < N n . Thus, the upper bound is 

(k* n A M mn )- 2{p ~ s) + m -( 1A ^ s )/ a ». (A.16) 
We consider two cases. First, let p — s > a. Suppose that n 2 ( p_s )/( 2p+2a+1 ) = 0(m n ). Then, 

fe * n l/(2a+2 P +l) n l/(2a+2 P +l) ^(p-) (]og ^ l/2a 



log 



0(1). 



This means that A;* < M mn , so the resulting upper bound is (fc* )~ 2 ^ p_s - ) +m~ 1 < (/c*)~ 2 ( p_s ). 
Suppose now that m n = o(n 2 ^ p ~ s ^^ 2p+2a+v >). If in addition k* n = 0(M mn ), then the first 
summand in (A.16) reduces to (fc^) -2 ^' - ^ and hence the upper bound is m~ l . On the other 
hand, if M mn /k* n = o(l), then the first term is (M n )~ 2( - P ~ s ^ < M~ 2a (log m n )~ l = m" 1 , since 
p — s > a. Combining both cases, we obtain the result in case p — s > a. 
Now assume p — s ^ a. First, suppose that fc* = 0(M mn ). Then, then the first summand 
in (A.16) reduces to (A;*) -2 ^ - ^ and moreover n 2a /( 2 'P+ 2a + 1 ) = 0(m n ). Therefore, the 
upper bound is {k^)~ 2 ^ p ~ s \ Consider now M mn = o(/c*). Then (A.16) can be rewritten as 
(m n / log m n ) _ ( p_,s - ) / a +m n ^ p s ^ a which results in the rate {m n / \ogm n )^^ p ^ s ^ a . Combining 
both cases gives the result. More precisely, m n = o(n 2a ^ 2p+2a+1 ^) implies M mn = o(fe*). 
On the other hand, in case n 2a /(2p+2a+i) _ 0(m n ), if k* n jM mn = O(l), then the rate is 
(K)~ 2p i while if M mn jk* n = o(l), we have the rate (m n / \ogm n )~ p l a . 
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[ss] Choose fc* ~ (logn) 1 / 2a (l + o(l)). And note that iV„ ~ [\ogn) l l 2a (l + o(l)) and 
M m „ ~ (logm n ) 1 / 2a (l + o(l)). The upper risk bound is now (fc* A M m J" 2p + (logm n ) p / a . 
Consider two cases. Firstly, logn/logm n = Oil). This implies N n /M mn = O(l) and hence 
k* n jM mn = O(l). This means that the upper bound is in fact (k^)~ 2p + (\ogm n )~ p l a ~ 
(logn)-P/ a . In the case logm n /logn = o(l), an analogous argument proves the claim, 
which completes the proof. □ 



A. 3. Proofs of section 4 

Proof of Theorem 4.3. Define A£ := maxo^^fc Uj/\ [(f]j\ 2 , := max < S | i |< sfc (o; J )vi/|Mj| 2 , 
and 51 := 2k A£ {]og(r% V (k + 2)) / log(/c + 2)}. Then, it is easily seen that 

S£^hd l ^^- = S k dQ Vfe>l. (A.17) 
log 3 

with = (log (3d)) /(log 3). Moreover, define the event VL qp := £l q n O p where is given 
in (A. 11) and 

Q p := {(iV< A J<) < (N n A M m ) < (JV n A M m )}. (A.18) 

Observe that on f2 g we have (1/2) A^ ^ A& ^ (3/2) A^ for all 1 ^ k ^ M m and hence 
(1/2) [A^ V (jfc + 2)] < [A fc V (k + 2)] < (3/2) [A£ V (fc + 2)], which implies 

n /m A ^/log[A£v^+2)] w log 2 log(fc + 2 



log(fc + 2) / V log(fc + 2) log(A£ V [A; + 2]) . 
< h < (3/2VcA^ l0g(A ^ V[fc + 2]) > | fl + l»g 3/2 log(fc + 2) n 



Using log(A^ V (A: + 2))/log(fc + 2) ^ 1, we conclude from the last estimate that 

^/10^(log3/2)/(21og3)^^(l/2)^[l-(log2)/log(fc + 2)]^? fe 

< (3/2)<S£[l + (log3/2)/log(fc + 2)] < 3<5£. 

Letting pen(A;) := 60 5^n _1 and pen(/c) := 600<5fcn _1 , it follows that on £l q 

pen(k) < pen(A;) 30pen(/c) V 1 sC fc < M m . 
On = n fip, we have k ^ M m . Thus, 

^pen(£; V k) + peh(k) — peh(A;)^ ^ ^pen(/c) + pen(fc) + peh(fc) — peh(/c)^ l{f2g P } 

^31pen(A;) VI < k < M m . (A.19) 

Furthermore, we have A^ ^ A^m for every k 1, which implies <5fc ^ m (1 + logm)^. 
Consequently, peh(A;) $C 10 m (1+logm) pen(/c) ^ 600 m {l+logm^dQd Si for all 1 ^ ^ 2V„ 
by employing (A.17) and the definition of N n . Therefore, on n Cl p , where k ^ N n , we 

have pen(/c V k) ^ 60 for all 1 ^ ^ N n , and hence 

(pen(fc Vfc) +pen(A;) -peh(k))l{n c n n p } < 60dCd5i(l + 10 m (1 + logm))l{^ n n p }. 
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(A.20) 

Now consider the decomposition 

n\Tk - fit = nh - filing} + n\T% - f\\m*r q n o p > + eh^ - nmn;}. 

Below we show that there exist a numerical constant C such that for all n, m ^ 1 and all 

1 < k < N l n A M l m we have 

E||4 - /|| 2 < cUf - f k \\l + d& ^ + r 

^ft±wmi (A . 21) 

n 



n\h - ntmi n ^} < c| 11/ - +rrf K . 



£ 1 + E(Cd|M| 2 imi 2 ) | Sx{dp% 



n m 



(A.22) 

nh - f\\m%} < c (f ) 7 (a.23) 

The desired upper bound follows for every 1 < k < (iV£ A A4J by virtue of Definition 4.1 
and Assumption 4.2. 

Proof of (A. 21). Following the proof in case of known degree of ill-posedness (Section A. 2) 
line by line, it is easily seen that for 1 $C k ^ (N l n A M l m ), 

(1/2)||4 " fWin^p} < (3/2)11/ - ml + 10 £ ( ^p |<t, <^>.| 2 - 6^ 

+ 8 sup - $ 9 ) w | 2 + rpen(fcVfc) + pen(fc)-pen(fc))l{ngp} 

< (3/2)||/ - A|| 2 + 10 V ( sup \(t, $ v ) u \ 2 - $-\ 

+ 8 sup \{t,$ g -$ 9 )<J 2 + 31pen(/c), 

where the last inequality follows from (A. 19). The third term is bounded by employing 
Lemma A. 3. In order to control the second term, apply Lemma A. 2 with <5£ = 5^ and 
A* = A£. Using (A.17), A£ < dr k , Cdlog« V (k + 2)) > log(r fc V (k + 2)) and the 
definition of S, we conclude with Assumption 3.2 that there exists a numerical constant 
C > such that 

sup \(t, ^U 2 -ft 6 *) < + £(|M| 2 ||/|| 2 C,)}. (A.24) 

Consequently, combining these estimates proves inequality (A. 21). 
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Proof of (A.22). On Q£ n p , we have N l n A M l m ^ N n AM m ^ N n A M m . Applying (A. 20), 
it follows in analogy to proof of Theorem 3.3 that for all 1 ^ k ^ N l n A M l m 



(1/2)11^ - f\\lHn c q n n P } < (3/2)||/ - f k \\l + 10 ^ sup |(t,^)J 2 - 6^ 



+ 8 sup |(t,$„-$„) (( ,l{fi£}| 2 + 8 sup |(t,$ ff -$ s 



+ (pen(fc V fc) + peh(A;) - pen(/c)J n ft p } 

< (3/2)||/ - / fc || 2 + 10 V sup \(t, <S> V ) U \ 2 - 6^ 

+ 8 sup \{t,$ v -$ v ) u l{S%}\ 2 + 8 sup |(t,8 9 -$ 9 ) w | 2 

+ 60(iCd5i(l + 10m(l + logm))l{O^nO p }. 
Due to Lemma A. 3, A. 4, and (A. 24), there exists a numerical constant C such that 



,l{^nO p } < chf-fkWl + dTKr. 



+ d( d 



Si+^mplWl + ^(P^])^) + 5l m (i + togm)P[flg] 

n ' 



Employing Lemma A. 5 now proves (A.22). 

Proof of (A. 23). Let ff. := 1 + So<|i|^fc[/]i-"-{l Mjl 2 ^ l/ m l e i- It is easy to see that 

II/* " AlL 2 < Wfk' - A'll 2 for all y < A; and ||/ fc - /|| 2 < ||/|| 2 for all fc > 1. Thus, using 
that 1 ^ k ^ (JV^ A m), we can write 

nh - /ii 2 m c p } < 2 in7t - MIhk} + nk - nm^}} 



M E||/ (7V „ Am) - / WAm) || 2 i{^} + ll/ll 2 P[%] 



Moreover, applying Theorem 2.10 in Petrov (1995) we conclude 

E||/(7V«Am) _ /(Af^Am)ll 2 l{^p} 



0<\j\4(N%Am) 

4i 1/2 



^2m{ £ Wi [E ([<?],■ - b]i) ] P[^] 1/2 

0<|jK(AT«Am) 

£ ^|[/],f[E(M,-H-) 4 ] 1/2 P[^] 1/2 } 



+ 

0<bK(iV«Aro) 



^ 2m| (2m max^Wj) (cn + ^ll/ll 2 }?^] 172 , 



which implies, using Definition 4.1 (ii), 



n\T% - /ii 2 w < ^{ (™ 2 + ii/ii 2 ) p[^i 1/2 + wni p[^i}. 
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Lemma A. 6 below together with Definition 3.1 (ii) yields, for some numeric C > 0, 



• ' m mP mP 

which completes the proof. □ 

Lemma A. 6 Consider the event Q p defined in (A. 18). Then we have 
P{n c p ) < 6(504(i/Ai) 7 m- 6 Vn,m^l. 

Proof. Let 0/ := {(A^ A M l m ) > (iV„AM m )} and Sl n := {{N n f\M m ) > (N n AM m )}. Then 
we have 0£ = 17/ Ufi/j. Consider 0/ = {JV n < (A^ A M^)} U {M m < (A^ A M^)} first. By 
definition of N l n , we have that min 1 ^| J |^j V ( |j|(^ J ] vl ^ 4 ( 1 °e n ) ^ which implies 

{A\ < (AT< AM^)} C jai SC |j| < (N l n AM l J : < ^ 

c U {S° /2 } c U 11 > 1/2}. 

One can see that from min ls g| J |^ M i |[y]j| 2 ^ 4 ^ lo ^ m ^ it follows in the same way that 



: n 



M m < {N l n A M l m ) \ C |J I IM./bb - 1| > 1/2 |. 

Therefore, fij C Ui<|j|<M m {lMj/Mj ~ *l > 1 / 2 }: since M m < M m- Hence, as in (A.15) 
applying Hoeffding's inequality together with the definition of M m gives 

P[0j] < ^ 2 exp (- m| ^ j H ^ 4(504 d/Ai) 7 m- 6 . (A.25) 

Consider O// = {iV n > (N n A M m )} n {M m > (A r „ A M m )}. In case (AT n A M m ) = N n , use 
^ max|j|j,jv n |j|(^] J j vl due to Assumption 4.2, such that 

n n c{N n >N n }c{vi^\ j \^N n : J f^> 1 ^} 
I \j\{Uj)vi n j 

In case (N n A M m ) = M m , it follows analogously from 5? maxiji^^ |[v?]j| 2 that 

C {M m > M m } C {|M Mm /MM m - 1| > l}. 



28 



Therefore, Sl n C {\[tp} NnAMm /[v}N n AM m - 1| > l} and hence as in (A. 15) applying Hoeff- 
ding's inequality together with the definition of M m gives 

P[n n ] < 2 exp f - m|M ^ AMj2 ) < 2(504 d/A!) 7 m- 7 . (A.26) 

Combining (A. 25) and (A.26) implies the result. □ 



Illustration: estimation of derivatives 

Proof of Proposition 4-5. We start our proof with the observation that in both cases the 
sequences 5, A, N and M are the same as in Proposition 3.5 and it is easily verified 
that the additional Assumption 4.2 is satisfied. Moreover in case [os] we have ~ 
(n/Clogn)) 1 /^ 2 ^ 1 ) and M l m ~ (m/(logm) 2 ) 1/(2a) - Let k* := n i/(2a+2 P +i) and note 
that still k* < N l n . In case [ss] we have N l n ~ {log(n/(logn)( 2 P+ 2a+1 )/( 2a ))} 1 /( 2a ) = 
(logn) 1 /( 2a )(l + o(l)) and M l m ~ {logfm/Clogm) 3 )} 1 /^ = (logm) 1 /( 2a )(l + o(l)). The 
rest of the proof in both cases is almost identical to the one of proposition 3.5 but uses 
and M l m rather than N n and M m , and we omit the details. □ 
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